Summarizing Noisy Documents
نویسندگان
چکیده
We investigate the problem of summarizing text documents that contain errors as a result of optical character recognition. Each stage in the process is tested, the error effects analyzed, and possible solutions suggested. Our experimental results show that current approaches, which are developed to deal with clean text, suffer significant degradation even with slight increases in the noise level of a document. We conclude by proposing possible ways of improving the performance of noisy document summarization.
منابع مشابه
Summarization Of Noisy Documents: A Pilot Study
We investigate the problem of summarizing text documents that contain errors as a result of optical character recognition. Each stage in the process is tested, the error effects analyzed, and possible solutions suggested. Our experimental results show that current approaches, which are developed to deal with clean text, suffer significant degradation even with slight increases in the noise leve...
متن کاملPerformance Evaluation of Quantitative Metrics on Ancient Text Documents Using Migt
In the present world scenario Optical Character Recognition (OCR) has wide variety of applications in the text document image analysis for recognizing individual characters of any language. Digitizing the old documents is a tough job for preserving the essence of the documents to the coming eras. In this paper we are summarizing different image quantitative metrics for estimating the loss of in...
متن کاملBiogeography-Based Optimization Algorithm for Automatic Extractive Text Summarization
Given the increasing number of documents, sites, online sources, and the users’ desire to quickly access information, automatic textual summarization has caught the attention of many researchers in this field. Researchers have presented different methods for text summarization as well as a useful summary of those texts including relevant document sentences. This study select...
متن کاملAutomatic Text Summarization in Engineering Information Management
In today’s knowledge-intensive engineering environment, information management is an important and essential activity. However, existing researches of Engineering Information Management (EIM) mainly focused on numerical data such as computer models and process data. Textual data, especially the case of free texts, which constitute a significant part of engineering information, have been somewha...
متن کاملSupervised Machine Learning for Summarizing Legal Documents
This paper presents a supervised machine learning approach for summarizing legal documents. A commercial system for the analysis and summarization of legal documents provided us with a corpus of almost 4,000 text and extract pairs for our machine learning experiments. That corpus was pre-processed to identify the selected source sentences in extracts from which we generated legal structured dat...
متن کامل